Packages

library(factoextra)
library(pheatmap)
library(tidyverse)
theme_set(theme_bw())

Data

We usually think of high-dimensional data as consisting of multiple measures on a group of samples:

  • Number of “reads” of different bacterial proteins from a set of soil samples
  • Decathlon scores from different competitors
  • Health measures from different children

Types of measures

Many scientists traditionally think of high-dimensional data as having parallel, continuous measures:

  • read matrices from soil samples

These may be complemented by a smaller number of “metadata” variables, which may be more diverse in type (count, categorical, etc.):

  • environmental variables associated with soil samples

More and more datasets don’t follow this:

  • Canadian longitudinal study on aging has a huge number of variables per person with a wide mixture of types

Duality

We study the rows (samples) using the columns (measures)

  • What do the observed proteins tell us about the functional relationships between different soil samples?
  • What does differential success in decathlon events tell us about the athletes?

But we can also do the opposite!

  • What do measurements across soil samples tell us about the functional relationships between proteins?
  • What does differential success of athletes tell us about the relationship between events?

Decathlon events

print(names(decathlon2))  ## from factoextra pkg
##  [1] "X100m"        "Long.jump"    "Shot.put"     "High.jump"   
##  [5] "X400m"        "X110m.hurdle" "Discus"       "Pole.vault"  
##  [9] "Javeline"     "X1500m"       "Rank"         "Points"      
## [13] "Competition"

Decathlon frame

dec_frame <- (decathlon2[1:10]
        %>% as_tibble()
    %>% rename_all(sub, pattern="^X", replacement="Run_")
    %>% mutate_at(vars(contains('Run_')), ~(-1*.))
    %>% rename_all(sub, pattern="110m.", replacement="")
    %>% mutate_all(~c(drop(scale(.))))
)

Pairs plot

Iris pairs plot (iris.R)

  • Slow, and not very good for more than 5 or 6 variables
  • Base R may be faster for quick viz

Heatmap

dec_mat <- as.matrix(dec_frame)
heatmap(dec_mat)

Heatmap again

heatmap(t(dec_mat))

Correlation heatmap

pheatmap(cor(dec_mat), cell.width = 10, cell.height = 10)

Better for visualizing groups of events

Athelete correlations

pheatmap(cor(t(dec_mat)), cell.width = 4, cell.height = 4)

Better for visualizing groups of events

PCA

A beautiful decomposition based on the idea that data points are points in a Euclidean space

  • Need to think about scaling

We can think about the PCA as a decomposition (making observed points from idealized points)

  • And relax it by requiring non-negative combinations of non-negative components (NMF)

Or we can think about it as minimizing distances:

  • And relax it with non-Euclidean distances (PCoA)
  • … or a rank-based approach (NMDS)

PCA decomposition

pca_ath <- prcomp(dec_mat, scale=TRUE)
fviz_screeplot(pca_ath)

Individuals

fviz_pca_ind(pca_ath)

Components

fviz_pca_var(pca_ath)

fviz_pca_var(pca_ath, axes=c(2, 3))

Biplot

View scores and components:

fviz_pca_biplot(pca_ath)

Loadings

load <- pca_ath$sdev * pca_ath$rotation
pheatmap(load, cluster_cols=FALSE)

WRONG!!!!

Loadings

load <- pca_ath$sdev * pca_ath$rotation
pheatmap(load, cluster_cols=FALSE)

WRONG!!!!

Non-metric dimensional scaling

library(vegan)
## Loading required package: permute
## Loading required package: lattice
## This is vegan 2.5-6
mds <- metaMDS(dec_mat, distance="euclidean")
## 'comm' has negative data: 'autotransform', 'noshare' and 'wascores' set to FALSE
## Run 0 stress 0.175356 
<<<<<<< HEAD
## Run 1 stress 0.175356 
## ... Procrustes: rmse 3.75291e-06  max resid 1.004819e-05 
## ... Similar to previous best
## Run 2 stress 0.2029784 
## Run 3 stress 0.175356 
## ... Procrustes: rmse 1.355825e-05  max resid 3.876792e-05 
## ... Similar to previous best
## Run 4 stress 0.2028176 
## Run 5 stress 0.175356 
## ... Procrustes: rmse 4.319605e-05  max resid 0.0001589884 
## ... Similar to previous best
## Run 6 stress 0.1975353 
## Run 7 stress 0.2035142 
## Run 8 stress 0.1787377 
## Run 9 stress 0.2020938 
## Run 10 stress 0.1787376 
## Run 11 stress 0.2146083 
## Run 12 stress 0.2035077 
## Run 13 stress 0.1783895 
## Run 14 stress 0.1975374 
## Run 15 stress 0.1754619 
## ... Procrustes: rmse 0.02261613  max resid 0.08184932 
## Run 16 stress 0.1950854 
## Run 17 stress 0.2028353 
## Run 18 stress 0.1950854 
## Run 19 stress 0.1788135 
## Run 20 stress 0.1787378 
## *** Solution reached
======= ## Run 1 stress 0.2081892 ## Run 2 stress 0.1754605 ## ... Procrustes: rmse 0.02238121 max resid 0.08059895 ## Run 3 stress 0.2008099 ## Run 4 stress 0.1755879 ## ... Procrustes: rmse 0.02861238 max resid 0.1113826 ## Run 5 stress 0.1754607 ## ... Procrustes: rmse 0.02245717 max resid 0.0810014 ## Run 6 stress 0.1754595 ## ... Procrustes: rmse 0.02222192 max resid 0.07975674 ## Run 7 stress 0.1783895 ## Run 8 stress 0.1754592 ## ... Procrustes: rmse 0.02204255 max resid 0.07924761 ## Run 9 stress 0.1783895 ## Run 10 stress 0.1754599 ## ... Procrustes: rmse 0.02197895 max resid 0.0791835 ## Run 11 stress 0.2851529 ## Run 12 stress 0.178739 ## Run 13 stress 0.1754606 ## ... Procrustes: rmse 0.02245449 max resid 0.08098683 ## Run 14 stress 0.2005249 ## Run 15 stress 0.1757427 ## ... Procrustes: rmse 0.00833189 max resid 0.03163406 ## Run 16 stress 0.1754607 ## ... Procrustes: rmse 0.02246205 max resid 0.08102683 ## Run 17 stress 0.2008961 ## Run 18 stress 0.2054214 ## Run 19 stress 0.2814632 ## Run 20 stress 0.1754596 ## ... Procrustes: rmse 0.02222927 max resid 0.07981159 ## *** No convergence -- monoMDS stopping criteria: ## 20: stress ratio > sratmax >>>>>>> refs/remotes/origin/gh-pages

Nmds plot

plot(mds)
## species scores not available

The biplot is hard (but we can help you if you want to do it)

Parallel co-ordinates

library(GGally)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
## 
## Attaching package: 'GGally'
## The following object is masked from 'package:dplyr':
## 
##     nasa
<<<<<<< HEAD

The biplot is hard (but we can help you if you want to do it)

Parallel co-ordinates

library(GGally)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
## 
## Attaching package: 'GGally'
## The following object is masked from 'package:dplyr':
## 
##     nasa
======= >>>>>>> refs/remotes/origin/gh-pages

ggparcoord(dec_frame)
<<<<<<< HEAD

=======

>>>>>>> refs/remotes/origin/gh-pages

Dual view

library(directlabels)
ath_frame <- (
    as.data.frame(t(as.matrix(dec_frame)))
    %>% mutate(event=names(dec_frame))
)
ggparcoord(ath_frame)
<<<<<<< HEAD

=======

>>>>>>> refs/remotes/origin/gh-pages

ggparcoord(ath_frame, groupColumn="event")
<<<<<<< HEAD

=======

>>>>>>> refs/remotes/origin/gh-pages

Andrews plots

library(andrews)
<<<<<<< HEAD
andrews(dec_frame)

andrews(ath_frame)

======= ## andrews::numarray() **does not like tibbles** andrews(as.data.frame(dec_frame))

andrews(ath_frame)

>>>>>>> refs/remotes/origin/gh-pages

Radar charts

Not implemented; may suffer similar problems

Ordering

Many of these charts might work if the variables were in a suitable order:

    <<<<<<< HEAD
  • heatmap provides an order =======
  • heatmap provides an order

    >>>>>>> refs/remotes/origin/gh-pages
    • hierarchical
  • other ordering methods take a more holistic view

  • <<<<<<< HEAD
  • PCA and other decompositions could also provide an order! =======
  • PCA and other decompositions could also provide an order!

    >>>>>>> refs/remotes/origin/gh-pages
    • Should we combine PCA with Andrews or parcoord?